Introduction

This is an data analysis of some important KPIs in Cerro Matoso, a Colombian Nickel Mine. The data set has 11 variables and 10006 observation, that are operational variables over a 1 year of records in 2 shift, day and night. The equipment are truck CAT777 operated by humans.

Analysis

Data

Cerro_matoso_1 dataset

The data set consists in 11 variables and 10006 observations. following the dictionary is listed.

  • Fecha: A date object that goes from 2017-07-01 to 2018-06-24
  • Turno: It contains a categorical variable 2 factor(‘dia’,‘noche’).
  • Operativo: A numerical Variable that comes from 0 Hours to 12 hour(a shift) per truck. THis is the real time where a truck or Equipo moves raw material over the mine. it’s like the effective time of operation.
  • Equipo:A categorical variable with 14 factor; Every mine’s dumb truck CAT777 of 100 Ton. From M3-32 to M3-45.
  • Mant Preventiva: Is a numerical variable that comes from 0 to 12 hours(shif) per truck. A truck can be down by maintenance for 12 hours per shift. it is related to time of preventive maintenance.
  • Mant Planeada: Is a numerica variable that comes from 0 to 12 hour(shift) per truck. Likewise, a truck can be down by 12 hours. this is maintenance.
  • Mant correctiva: Is a numerical variable and it comes down to another maintenance time.
  • over haul: Have you ever watch OVERHaulin’ in TV, Something like that. when a truck is in a long maintanence stage.
  • Demora prog: Is a numerical variable that it is related to non-operational and non-maintenance time. it’s when the operator take a lunch, change shift or wait for fuel supply.
  • Demora No prog: Is a numerical variable and is related to another non-operational and non-maintenance time. it’s aside of Programada programada it can not be planned early, for instance, a brast or get stuck in the ground when it is raining even when it raining.
  • DemoraStanBy: When a truck is no longer needed. it can be put in standby.
knitr::kable(cerro_matoso_1[1:5, ])
fecha Turno Operativo Equipo Mant Preventiva Mant Planeada Mant Correctiva Over Haul Demora Prog Demora Noprog DemoraStandBy
2017-07-01 Dia 0.000000 M3-32 0 12.000000 0 0 0.000000 0.000000 0
2017-07-01 Dia 10.002222 M3-33 0 0.000000 0 0 1.450555 0.547222 0
2017-07-01 Dia 1.150555 M3-34 0 10.468888 0 0 0.380555 0.000000 0
2017-07-01 Dia 10.035277 M3-35 0 0.000000 0 0 1.964722 0.000000 0
2017-07-01 Dia 7.686111 M3-37 0 1.713055 0 0 1.607222 0.993611 0

cerro_matoso_group dataset

This is a wrangling of the cerro_matoso_1. You may see the operation in following script.

 cerro_matoso_group<-cerro_matoso_1%>%
  group_by(fecha, Turno)%>%summarize(
    Operativo=sum(Operativo),
    preventiva=sum(`Mant Preventiva`),
    planeada=sum(`Mant Planeada`),
    correctiva=sum(`Mant Correctiva`),
    overall=sum(`Over Haul`),
    programada=sum(`Demora Prog `),
    noprogramada=sum(`Demora Noprog `),
    stanby=sum(DemoraStandBy))%>%
  mutate(Tcalendario=(Operativo+preventiva+planeada+correctiva+overall+programada+noprogramada),
    disponibilidad=(Tcalendario-preventiva-correctiva-planeada-overall)/Tcalendario,
    utilizacion=Operativo/(Tcalendario-preventiva-correctiva-planeada-overall))
knitr::kable(cerro_matoso_group[1:5, ])
fecha Turno Operativo preventiva planeada correctiva overall programada noprogramada stanby Tcalendario disponibilidad utilizacion
2017-07-01 Dia 96.93527 12 24.181943 0.000000 0.000000 17.64361 5.239165 0 156 0.7680644 0.8090207
2017-07-01 Noche 89.79277 12 12.000000 0.000000 0.000000 24.04666 18.160552 0 156 0.8461538 0.6802484
2017-07-02 Dia 87.95666 12 22.760833 0.341111 0.000000 19.84250 13.098888 0 156 0.7749875 0.7275276
2017-07-02 Noche 43.32527 12 8.532222 10.242222 0.969722 80.93055 0.000000 0 156 0.7965117 0.3486780
2017-07-03 Dia 88.46916 12 4.510833 3.401666 11.566944 20.58777 15.463608 0 156 0.7982087 0.7104784

In summary it turns out that is needed summarized how much time have each times taken group by date an shift of the all truck. Moreover, 3 variable are mutated to a dataframe, the first Tcalendario is the sum of;

\[preventiva + planeada + correctiva + overall +programada + noprogramada\]

disponibilidad that is:

\[\%disponibilidad = \frac{Tcalendario-preventiva-correctiva-planeada-overall}{Tcalendario}\]

As you can see, disponibilidad is a rate of percentange which assesses the equipment and maintenance group performance.

Other variable added is utilization that is

\[\%utilizacion =\frac{operativo}{Tcalendario-preventiva-planeada-correctiva-overall}\].

Like disponibilidad, utilizacion a rate of how effective the operation use the equipments for moving material over the time delived by maintenance.

Cerro_matoso_equipo dataset.

cerro_matoso_equipo <- cerro_matoso_1%>%group_by(fecha,Equipo)%>%
  summarize(
    Operativo=sum(Operativo),
    preventiva=sum(`Mant Preventiva`),
    planeada=sum(`Mant Planeada`),
    correctiva=sum(`Mant Correctiva`),
    overall=sum(`Over Haul`),
    programada=sum(`Demora Prog `),
    noprogramada=sum(`Demora Noprog `),
    stanby=sum(DemoraStandBy))%>%
  mutate(
    Tcalendario=(Operativo+preventiva+planeada+correctiva+overall+programada+noprogramada),
    disponibilidad=(Tcalendario-preventiva-correctiva-planeada-overall)/Tcalendario,
    utilizacion=Operativo/(Tcalendario-preventiva-correctiva-planeada-overall))   


knitr::kable(cerro_matoso_equipo[1:5, ])
fecha Equipo Operativo preventiva planeada correctiva overall programada noprogramada stanby Tcalendario disponibilidad utilizacion
2017-07-01 M3-32 0.000000 0 24.000000 0 0 0.000000 0.000000 0 24 0.0000000 NaN
2017-07-01 M3-33 12.210555 0 0.000000 0 0 1.896943 9.892499 0 24 1.0000000 0.5087732
2017-07-01 M3-34 9.428055 0 10.468888 0 0 3.814721 0.288333 0 24 0.5637963 0.6967688
2017-07-01 M3-35 19.183054 0 0.000000 0 0 4.816944 0.000000 0 24 1.0000000 0.7992940
2017-07-01 M3-37 17.124722 0 1.713055 0 0 4.168610 0.993611 0 24 0.9286227 0.7683746
###cerro_mato so_gather dataset.
cerro_matoso_gather<-cerro_matoso_1%>%
  group_by(fecha)%>%
  summarize(
    Operativo=sum(Operativo),
    preventiva=sum(`Mant Preventiva`),
    planeada=sum(`Mant Planeada`),
    correctiva=sum(`Mant Correctiva`),
    overall=sum(`Over Haul`),
    programada=sum(`Demora Prog `),
    noprogramada=sum(`Demora Noprog `),
    stanby=sum(DemoraStandBy))%>%
  mutate(
    Tcalendario=(Operativo+preventiva+planeada+correctiva+overall+programada+noprogramada),
    disponibilidad=(Tcalendario-preventiva-correctiva-planeada-overall)/Tcalendario,
    utilizacion=Operativo/(Tcalendario-preventiva-correctiva-planeada-overall))%>%
  select(fecha,disponibilidad, utilizacion) %>%gather(key = categoria,value=value,-fecha)





knitr::kable(cerro_matoso_gather[1:5, ])
fecha categoria value
2017-07-01 disponibilidad 0.8071091
2017-07-02 disponibilidad 0.7857496
2017-07-03 disponibilidad 0.8595192
2017-07-04 disponibilidad 0.8090705
2017-07-05 disponibilidad 0.8245157

plots

Ones of the most important mine KPIs are disponibilidad and utilizacion. Both are often taken as part of bussiness cases in which are those features to being optimized. Nevertheless, previously, before this essay both were taken as individual variables without any correlation to each other. What it is false. The projects were bogus due to just make effort on reducing one of them without see the behavior of the another. what it leds to mistakes and non-appoined result. So it is neeeded whether a correlation exists. To see so, it may be carried out a correlation plot or a inferential analysis(F statistic, T student). However for being more sharp, we’d rather liked a Simple linear regression, where if the changes onn utilization are due to chance and there’s no relation with disponibilidad or there’s is relation.

We can frame this investigation into statistical hyphothesis test

H0: Slope=0, The true linear model has slope and there’s no relation between disponibilidad and utilization

HA: slope =!0, the true linear model has a slope different to zero, and utilization’s change is related to disponibilidad.

Utilizacion vs Disponibilidad

##----------lays out utilization vs disponibilidad coloured by shift.
##----------furthermore,method  simple linear regresion line for more interpretability.
ggplot(cerro_matoso_group, aes(x=disponibilidad, y=utilizacion, col=Turno))+
  geom_point()+geom_smooth(se = F,method = lm)

As it can seen graphically, there’s non-zero slope what it means the change on utilization can be explain with changes in disponibilidad, thereby aren’t separeted features. To do a stronger proof. Linear model summary can show a better approach.

Linear simple regression.

lm(utilizacion~disponibilidad, data=cerro_matoso_group)%>%
  summary()
## 
## Call:
## lm(formula = utilizacion ~ disponibilidad, data = cerro_matoso_group)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66726 -0.06965  0.04028  0.12112  0.24789 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.15122    0.05301  21.717  < 2e-16 ***
## disponibilidad -0.55505    0.06850  -8.103 2.32e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1695 on 716 degrees of freedom
## Multiple R-squared:  0.08399,    Adjusted R-squared:  0.08271 
## F-statistic: 65.65 on 1 and 716 DF,  p-value: 2.323e-15

Here can be seen:

  • p-value of disponibilidad slope is nearly zero, whereby we successfully reject the null hyphotesis we initially put above.
  • only 8% of utilizacion is explained by disponibilidad(R-squared)
  • the slope is negative. which it means that when disponibilidad increases less utilization can be performanced by the mine operation.

The main reason whereby when disponibilidad increases and utilization decreses is because more equipment are available and the same people available,something like lack of staff.The opposite occurs when the disponibilidad decreases, the utilizacion increase unexpectedly.Then engineers can say that their outstandings managements over the fleets are giving results.

look at when the disponibilidad goes downward the utilizacion goes upward

The previous conclusion opens up widely window where we can see the correlation among the perfomance KPIs let’s do it!

correlation plot.

Many hidden correlations suddenly pop up. Yes there’s negative correlation between utilizacion and disponibilidad, isn’t strong but it exists.Following conclusions can be taken from the plot

First of all, the correlation above is just a pearson correlation that doesn’t include categorical variable like Turno that can incoporate different aproaches. It may turn out that simpson’s paradox comes in and the same behavior doesn’t occur anymore.

  • It looks like Tcalendario has the strongest correlation with Standby. It turns out that when the Tcalendario decrease is because a truck or some truck have been put on standby.
  • Operativo is highly affected by utilizacion, what it can be seen as pros or as cons. We could say " let’s increase utilizacion and then we would have better operativo time" nevertheless a increasing in operativo time not only depend on utilization and utilization doesn’t depend on itself. Cofounding can occur.
  • Going forward an obvious correlation exists between utilizacion and programada and indeed a strong correlation between programada and operativo. whether we want to carry out a multiple linear regression, both as much as programada and utilizacion are collinear, we should drop out one of them.
  • Forget simpson’s paradox for a while; standby and operativo have a unexpected correlation. why is there a correlation between two features that seems not to be linked at all?

A scatter plot can show why.

As it has been shown, simpson’s paradox displays other a clue about the effect standy is able to do when Turno pops up.

ggplot(cerro_matoso_group, aes(x=stanby, y=utilizacion,col=Turno))+
  geom_point( )+geom_smooth(se = F,method = lm)

Look at this! it seems like utilizacion is influenced by Turno when it is night.So there might be increments in utilizacion due to standby. So the advice only apply at night, so engineer, let’s check closer why does utilizacion increase.

Truck Performance.

Once utilizacion has been checked, the original dataset give us more data and more analysis do with. So let’s see!

When we work with groups of data, we want to see a regular behavioir inside the group and between groups. In this case, trucks give operativo time for moving and dumbing raw material, and we need to figure out if each truck operativo time is within the variable’s natural variabiity.An ANOVA can asses whether there are different or not among the trucks and if so, which are below of natural performance and why?.

aov(Operativo~Equipo, data=cerro_matoso_equipo)%>%summary()
##               Df Sum Sq Mean Sq F value Pr(>F)    
## Equipo        13  23408  1800.6   38.14 <2e-16 ***
## Residuals   4989 235548    47.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • p-value less than 0.05 means the trucks have difference among the operativo time that they give every Turno(shift).

Let’s check which trucks are below

d=pairwise.t.test(cerro_matoso_equipo$Operativo,cerro_matoso_equipo$Equipo, alternative = "two.sided")
d$p.value
##              M3-32        M3-33        M3-34        M3-35        M3-36
## M3-33 9.567019e-02           NA           NA           NA           NA
## M3-34 4.476077e-09 1.678001e-02           NA           NA           NA
## M3-35 5.497719e-03 6.165131e-10 5.012575e-23           NA           NA
## M3-36 2.252164e-09 9.585101e-03 1.000000e+00 3.277560e-23           NA
## M3-37 3.928567e-07 1.582986e-01 1.000000e+00 6.299672e-20 1.000000e+00
## M3-38 9.421295e-05 6.744131e-13 2.599119e-27 1.000000e+00 1.919929e-27
## M3-39 2.056486e-04 2.440391e-12 1.611487e-26 1.000000e+00 1.173195e-26
## M3-40 3.975589e-02 1.000000e+00 4.365755e-02 8.196918e-11 2.636163e-02
## M3-41 4.942497e-10 4.891354e-03 1.000000e+00 1.605858e-24 1.000000e+00
## M3-42 8.761861e-04 1.000000e+00 6.604929e-01 3.853981e-14 4.411076e-01
## M3-43 8.074151e-07 2.271800e-01 1.000000e+00 2.119518e-19 1.000000e+00
## M3-44 1.000000e+00 1.108009e-01 6.303416e-09 4.600089e-03 3.171901e-09
## M3-45 1.230655e-08 2.884238e-02 1.000000e+00 2.479742e-22 1.000000e+00
##              M3-37        M3-38        M3-39      M3-40        M3-41
## M3-33           NA           NA           NA         NA           NA
## M3-34           NA           NA           NA         NA           NA
## M3-35           NA           NA           NA         NA           NA
## M3-36           NA           NA           NA         NA           NA
## M3-37           NA           NA           NA         NA           NA
## M3-38 5.938097e-24           NA           NA         NA           NA
## M3-39 3.280270e-23 1.000000e+00           NA         NA           NA
## M3-40 3.403895e-01 6.909860e-14 2.620584e-13         NA           NA
## M3-41 1.000000e+00 6.050165e-29 4.046520e-28 0.01442720           NA
## M3-42 1.000000e+00 1.286961e-17 5.735932e-17 1.00000000 3.034371e-01
## M3-43 1.000000e+00 2.238479e-23 1.195735e-22 0.45370151 1.000000e+00
## M3-44 5.321288e-07 7.325583e-05 1.614330e-04 0.04560703 6.981636e-10
## M3-45 1.000000e+00 1.480952e-26 9.057542e-26 0.07146187 1.000000e+00
##             M3-42        M3-43        M3-44
## M3-33          NA           NA           NA
## M3-34          NA           NA           NA
## M3-35          NA           NA           NA
## M3-36          NA           NA           NA
## M3-37          NA           NA           NA
## M3-38          NA           NA           NA
## M3-39          NA           NA           NA
## M3-40          NA           NA           NA
## M3-41          NA           NA           NA
## M3-42          NA           NA           NA
## M3-43 1.000000000           NA           NA
## M3-44 0.001094003 1.085603e-06           NA
## M3-45 0.948545202 1.000000e+00 1.717715e-08

This is insane!, none has time to see this, this isn’t data analysis. Let’s see visually.

##---------------------------lays out  a boxplot as a proof of different operativo's time.
g=ggplot(cerro_matoso_equipo,aes(x=Equipo,y=Operativo, col=Equipo))+geom_boxplot()+ylab('Operativo(hours)')
ggplotly(g)

The total time given by every truck within a day(24hours)

As you have noticed visually, there almost 4 truck that seem to be underneath of the median.